Tenant-specific Optimization in Tectonic
Learn how to perform tenant-specific optimizations in Tectonic.
In the previous lesson, we discussed how we used multitenancy to fairly and efficiently share IOPS and storage capacity. Earlier tenants were using different strategies to store data reliably. Some were using full data replication for speedy writes and reads, while others were using the Reed-Solomon encoding to reduce storage needs, but at the cost of added latency (that will be needed to encode data while writing and decoding data while reading).
We allowed multiple tenants with various workload characteristics and performance requirements to work on the same shared storage. We’ll enable tenants to request their required storage mechanism via the Client Library that we discussed earlier in our design.
Overview#
Now, we consider two tenants, data warehousing and blob store, as our examples to explain the specific storage consumption or latency-related optimizations. The following are the two ways to perform tenant-specific optimization for low latency and storage efficiency.
-
Optimizing writes on data warehouse: We need to optimize how large-scale data can be stored using full-block operations. Since the data is large, we can’t use partial block operations because, in such use cases, partial blocks will increase the latency and decrease storage efficiency.
-
Optimizing blob storage: We need to optimally store the small-scale data (blobs) as well, where blob storage comes in. Since the data is not at a large scale, we’ll perform partial-block operations on both hot and warm blobs.
The following illustration shows the summary of the optimizations for both tenants.
Optimizing writes on data warehouse#
Writing data once and reading it many times later is a dominant pattern in workloads for data warehouses. The file is only accessible to readers for certain workloads after the file is closed by its creator. As a result, the file becomes immutable forever. Since the data can only be read once the file creator is done writing, we prioritize write requests with low latency over the low latency append requests.
Since we will perform write operations on a large amount of data, these write operations will be on the full block. To optimize such write operations, we have used the following two ways:
-
Reed Solomon(RS)-encoded asynchronous writes: This uses write-once-read-many for optimal network, storage, and IO performance.
-
Hedged quorum writes: This is for generating reservation requests to decrease latency.
RS-encoded asynchronous writes#
The write-once-read-many design pattern is used by Tectonic to decrease the overall file write time while increasing IO and network performance. Applications can buffer writes up to the block size because this approach doesn’t involve partial file reads. The blocks are subsequently RS-encoded by applications, and then the storage nodes store the chunks of data in them, as shown below.
1 of 4
2 of 4
3 of 4
4 of 4
The data that will be lived longer is normally encoded using , while the data that will be lived for a shorter period of time is usually encoded with . In , is the total number of bits after encoding, whereas is the original data bits, and its decoder can perform error correction up to symbols where . In , where and , will be . This means that the encoding for long-lived data with will perform error correction up to symbols anywhere in its basic data unit. In , where and , will be which is . This means that the encoding for short-lived data with will be able to correct about 1.5 corrupted symbols per basic data unit.
Note: It is a well-known bound from the coding theory that for decoding is only possible if , where k was the original message size. is the code words size after encoding and are the parity bits. Some notations use ) (where is the size of the original data, and represents parity bits) to explicitly tell about . We will use in our notation. Usually, it is clear from the context which notation is being used but we should be careful.
Storage space, network bandwidth, and disc IO are all reduced when writing entire blocks that have been RS-encoded. Since less data is written overall on the ChunkStore, the storage, bandwidth, and disc IO are reduced. For a tiny write request, whose quantity of data to be transferred is substantially less than the data replication, more IOPS are required to write chunks to discs in . By employing larger block sizes, the bottleneck for full-block writes is now disc bandwidth rather than IOPS, leading to more effective disc IO.
Applications can write a file’s blocks concurrently and asynchronously using the write-once-read-many technique, greatly reducing file write latency. After the file’s blocks have been written, the file’s metadata is completely updated. Since a file is only accessible when it has been entirely written, there is no chance of inconsistent behavior using this approach.
Hedged quorum writes#
Hedged quorum writes are used to improve tail latency. With no additional IO required, Tectonic’s full-block writes use a type of quorum for write operations that decrease tail latency. Rather than sending the chunk to write further on the storage nodes, Tectonic only sends reservation requests. By doing so, the nodes which accept the request of reservation first will perform that write operation by storing the chunks. Such behavior of reservation is closer to hedging. However, the reservation prevents sending data to such nodes, which will reject the write request due to a shortage of resources or because the requester has used more resources on that node than it was allocated.
For example, if the Client Library needs to write to three storage nodes, it can send the request to five nodes and can pick the first three that respond first.
1 of 4
2 of 4
3 of 4
4 of 4
With a highly loaded cluster, the hedging method becomes more effective. In an empirical study, after applying hedged quorum in a test cluster with % of throughput utilization, they experienced approximately % improvement in the full-block write and th percentile latency for -encoding.
Optimizing blob storage#
With the increase of objects, it becomes challenging to index all the objects in blob storage. Log-structured files are used to store the blobs by appending them to the file. Blobs are located using a map that links the blobID to the blob’s position in the file. Since the blobs are smaller in size, less than block size , we’ll perform partial-block operations for blobs rather than full-block operations.
To optimize blob storage, we need to optimize how hot (blobs that are actively read) and warm (blobs on which read activity has substantially decreased) blobs are being stored. We can do so in the following two ways:
- Consistent appends on partial blocks: They help decrease the latency of hot blobs since they are needed to be accessed more frequently. Since this data is being more frequently accessed, it is better to get it fully replicated consistently.
- Re-encoding blocks: They help with the compression of warm blobs, which will rarely be accessed. Since this data is rarely accessed, it is better to get it re-encoded and rewritten using RS to achieve maximum compression.
Consistent appends on partial blocks#
Blob storage also entertains multiple user requests, so it is better to have low latency. The size of a blob is smaller than the block size , so one block can store multiple blobs. Whenever these small blobs come for write, they are written on the partial block, and we perform replication on that partial block append for the low latency. To make our blobs available right after the write operation, we make the partial block append read-after-write consistent. This partial block replication takes more space on the disk than earlier discussed RS-encoding of full-block data.
Partial block quorum appends provide us durability, low-latency, and consistent blob writes. Two scenarios in which the temporary decrease of durability is accepted by the quorum write are as follows:
- The block will be re-encoded soon.
- Another data center keeps the second copy sent by blob storage.
Even partial block quorum appends can cause issues such as incomplete or missed append operations that can cause different variations of the chunk replicas of different sizes if not handled carefully. For example, we perform replication of the append operation performed by different clients A, B, and C. Client A commits bytes of the append operation, but unfortunately, the second replica fails. Similarly, Client B also commits bytes of the append operation, but the first replica fails this time. Client C also commits bytes of the append operation, but the third replica failed this time, as shown below.
1 of 3
2 of 3
3 of 3
Now, all the chunks are of the same size – . However, each copy is different than the other one. The actual size of the append, which is , is different from all the stored chunks. When the reader comes to read the file, expectations would be to read a chunk based on the information provided by the metadata, but none of these chunks have a size of . Giving any chunk to the reader will contain inaccurate and incomplete data.
To avoid this, we have to control who is allowed to append the block and when those appends will be visible by using the following four steps:
- We only allow the writer that has created the block to append that block.
- We commit the updated block size and checksum to block metadata when the append operation is completed before acknowledging the partial block quorum append.
- After the commit, the replica of the block up to size is stored on a minimum of storage nodes.
- The clients with the read request can read the contents of the block up to ( being the offset) when the acknowledgment is generated by the system and will be updated in block metadata.
A single appender with this ordering of operations provides consistency.
The above graphs are conceived from the Tectonic paper.
The fact that Haystack and Tectonic’s blob storage have similar read and write latencies (as shown in the graphs above) confirms that Tectonic’s generality does not significantly affect performance.
Re-encoding blocks#
As we already discussed, applying RS-encoding on full blocks is IO efficient, but small partial block appends are inefficient. Therefore, instead of RS-encoding after each small append, we use three-way replication until a full block of data is written. When a block is full, it is sealed for any further mutation. At that point, our system reads the block from the replicated storage and RS-encodes it for storage efficiency. The number of IOPS used this way is fewer as compared to IOPS that would have been needed had we RS-encoded after each append.
1 of 6
2 of 6
3 of 6
4 of 6
5 of 6
6 of 6
In this lesson, we learned about the optimization of storage (non-ephemeral resources). In the upcoming lesson, we’ll learn about how Tectonic performed during production and the tradeoffs between simplicity and performance.